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1 Fighting the Space Explosion 

With the increase in complexity and degree of parallelism of computer systems, it became even more 
important to develop formal methods for ensuring their quality. Correctness and reliability became a must 
have flavor for business success, and therefore, various techniques for automated and semi-automated 
formal verification and analysis have been designed and successfully applied. Formal verification and 
analysis bring many benefits such as early integration in design process, more effective detection of logic 
errors, etc. Even though introduction of formal analysis is rather costly, it pays off after all as it results 
in significant reduction in verification time as well as development costs and time-to-market. Attempts 
are being made to integrate formal verification techniques and tools with other design approaches to 
support engineering of complex industrial systems. The iFEST Artemisia project ||59l is an example of 
a promising tools integration in embedded systems domain. 

Model checking is a distinguished technique of formal verification of complex hardware and soft- 
ware designs. Founders of the technique Edmund M. Clarke jr. (CMU, USA), Allen E. Emerson (Texas 
at Austin, USA), and Joseph Sifakis (IMAG Grenoble, France) were awarded ACM Turing Award in 
2007 for their roles in developing model checking into a highly effective verification technology, widely 
adopted in the hardware and software industries. Unfortunately, model checking procedure is com- 
putationally demanding and memory-intensive in general, hence, its applicability to large and complex 
systems routinely seen in practice these days is still limited. The major hampering factor is the state 
space explosion problem due to which large industrial models cannot be efficiently handled unless more 
sophisticated and scalable methods are used. 

A lot of attention has been paid to the development of approaches to fight the state space explo- 
sion problem OTl in the field of automated formal verification fTTl . Many techniques, such as state 
compaction (47], compression |[56l . state space reduction Il76l [36l l44]| . symbolic state space represen- 
tation |29J, etc., were introduced to reduce the memory requirements needed to handle the verification 
problem with a standard sequential software tool. These techniques allowed system developers to verify 
larger systems without the need of increased computing power. 

To verify even larger systems, however, no option was left out than to employ combined computing 
power of multiple computing devices. Unfortunately, some verification techniques cannot preserve their 
efficiency if adapted to non-sequential models of computation, and therefore an urgent need for new and 
quite different verification procedures emerged. Many new techniques have been introduced. Some of 
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them are applicable across a broad range of computing platforms, some of them are tailored to the specific 
capabilities of a particular hardware architectures. Examples include techniques to fight the memory 
limits with an efficient utilization of external memory devices [80], techniques that introduce cluster- 
based algorithms to employ the aggregate power of network-interconnected computers 1179117311461 1411. 
techniques to speed-up the verification process on multi-core processors ll58l[mr7Hl . etc. 

However, back at the beginning of the 21st century, many of these techniques waited to be yet dis- 
covered. Even at that time the idea of using combined resources to increase the computational power 
was far from being new in formal verification. Attempts to use hard drives or parallel computers for 
verification of large systems have appeared in the very early years of the automated formal verification 
era. However, the inaccessibility of cheap parallel computers with sufficiently fast external memory de- 
vices together with the negative theoretical complexity results excluded these approaches from the main 
stream in formal verification. Moreover, thanks to the Moore's law, the performance of software tools 
kept improving continuously for years as the power of a single cored CPU grew. The situation changed 
dramatically with oncoming of multi-core CPU chips. The progress in computer design over the past 
decades had measured several orders of magnitude with respect to various physical parameters such as 
power consumption, efficiency, physical size or cost. As a result, it became more efficient for chip pro- 
ducers to introduce multiple CPU cores on a single chip rather than to increase the speed of a single core. 
As the speed of a single core virtually stopped growing, every piece of software that was built upon a 
serial algorithm could not take the advantage of technological progress anymore. The focus of paral- 
lel and distributed-memory computing community shifted away from unique massively parallel systems 
competing for world records towards smaller and more cost effective systems built up from small and 
cheap personal computer parts. Suddenly, the need for parallel processing become rather general and 
wide spread in all science fields relying on complex computation operations, automated formal verifica- 
tion being not an exception. As a matter of fact, the interest in the platform-dependent formal verification 
has been revived. 

2 The DiVinE Story! 

DiVinE |fl5l [T8l is a tool for LTL model checking and reachability analysis of discrete distributed 
systems. The tool is able to efficiently exploit the aggregate computing power of multiple network- 
interconnected multi-cored workstations in order to deal with extremely large verification tasks. As such 
it allows to analyse systems of which size is far beyond the size of systems that can be handled with 
regular sequential tools. DiVinE tool follows the explicit- state automata-based approach to LTL model 
checking. Due to Vardi and Wolper [81], the LTL model checking problem reduces to the problem of 
emptiness of Biichi automata, hence to the problem of the accepting cycle detection in the underlying 
directed graph of a Biichi automaton. 

2.1 Parallel Algorithms in LTL Model Checking 

The need of parallel processing in automated formal verification stemmed from the desire to fight the state 
space explosion problem by employing aggregate memory of multiple network interconnected worksta- 
tions. The crucial aspect studied at first was how to distributed the work among participating processors 
in order to take advantage of aggregate memory and parallel processing at the same time. 

Based on a parallel algorithm for state space generation Il30l a static partitioning scheme relying on 
a hash function was introduced ||34| . As observed by multiple researchers, the hash-based partitioning 
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yields better space locality if only parts of the state descriptor are used as the input to the partitioning 
function. There were approaches requiring the user of the tool to specify the concrete parts of the state 
descriptor to be used for partitioning LMiiZlJ, other approaches employed automated or semi-automated 
techniques to do it |[35l . DiVinE implicitly uses a hash-based partitioning over the full state descriptor. 
Parts of the descriptor used for partitioning might be statically redefined prior compilation. Techniques 
for load balancing the set of visited states, also known as re-partitioning techniques, have been sug- 
gested 1 21 ElllTOJ as well as state space generation schemes employing probability aspects Ii67l l66l . 
Nevertheless, none of them have been implemented in DiVinE. 

When DiVinE model checker development started, several previous tools had existed. The first 
known public implementation of a distributed memory tool for verification of communication protocols 
was the parallel implementation of the Mmcp tool ||39l |79l . Murcp parallel work-flow relied on the 
standard MPI-like approach to messaging, nevertheless, active messages were later introduced into Murcp 
to improve its efficiency H3l . The successful story of the Mmcp was followed by other verification 
tools: SPIN El ITU, CADP (46], UPPALL ||22|, etc. Distributed-memory state space generation as a 
technique of automated formal verification also appeared in the context of Petri Nets |[34ll55l and Markov 
chains [54ll53]| . 

Prior the DiVinE model checker, the existing distributed-memory parallel tools were focused on 
state-space generation and reachability analyses only. The reason was simple: the lack of parallel al- 
gorithms for accepting cycle detection in distributed-memory setting. Nested Depth First Search algo- 
rithm (Nested DFS) and other algorithms relying on dept-first search stack cannot be used in distributed- 
memory setting as the distributed and parallel maintenance of the depth-first search stack is inefficient iH. 
Therefore, new parallel and distributed-memory algorithms for accepting cycle detection had to be in- 
troduced with the development of DiVinE tool. The first implementation of DiVinE employed the so 
called dependency structure to record the reachability relation among accepting states of a dis- 
tributed graph and applied the topological sort algorithm |[65]| to detect the presence of a self-reachable 
accepting state. Other parallel algorithms appeared with the time building upon various ideas: detec- 
tion of negative cycles (NEGC Algorithm) 1,28. 26 1, explicit-state implementation of symbolic SCC hull 
detection (OWCTY Algorithm) 1311 . value propagation (MAP Algorithm) ll27l . or algorithm based on 
back- level edges as produced by a breadth-first search procedure (B LEDGE Algorithm) These 
new algorithms differed in theoretic complexity as well as practical efficiency. After large experimental 
evaluation, some of the algorithms were discontinued in DiVinE. The latest version of DiVinE employs 
a combination of MAP and OWCTY algorithms by defualt [12J. Table in Figure [T] gives complexities 
and on-the-fly abilities of newly introduced parallel LTL model checking algorithms. 

Distributed-memory processing cannot fight the state space explosion problem alone and must be 
combined with other techniques. One of the most successful technique to fight the state space explosion 
in explicit-state model checking is Partial Order Reduction fTSl . DiVinE is able to perform this reduc- 
tion, however, new topological sort proviso had to be developed in order to maintain efficiency of parallel 
and distributed-memory processing [13]. Another important algorithmic improvement relates to classifi- 
cation of LTL formulas |32|. For some classes of LTL formulas (weak LTL) the parallel algorithms may 
by significantly improved |7|. With this observation the OWCTY algorithm can be improved so that 
its complexity even meets the complexity of the optimal sequential Nested DFS algorithm, see Table in 
Figure [T] However, this algorithm suffers from not being an on-the-fly algorithm. Since the on-the-fly 
verification is an important practical aspect, we have devised a modification of this algorithm that allows 
for on-the-fly verification in most verification instances 112). 

While DiVinE focuses on "complete" verification, parallel distributed-memory "incomplete" verifi- 
cation due to lossy state compaction has been introduced by PReach tool 1241 . 
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Figure 1 : Overview of complexities and on-the-fly processing ability of Nested DFS and parallel algo- 
rithms for accepting cycle detection. 

2.2 Algorithm Engineering for Parallel LTL Model Checking 

There is no doubt that without an appropriate parallel algorithm the LTL model checking procedure can- 
not be successfully adapted to contemporary parallel computing platforms. Nevertheless, the algorithm 
is not the only ingredient required. Even the best algorithms in theory may not outperform good-but- 
not-optimal algorithms that are equipped with platform-aware heuristics. This observation is even more 
applicable to parallel processing where the scalability and absolute runtime reduction are typically more 
valued achievements than theoretical optimality. To that end there is another ingredient behind the de- 
velopment of parallel and distributed-memory tool DiVinE - Algorithm Engineering. 

Efforts must be made to ensure that promising algorithms discovered by the theory commu- 
nity are implemented, tested and refined to the point where they can be usefully applied in 
practice. 

[Aho et al. [1997], Emerging Opportunities for Theoretical Computer Science] 

In other words, characteristics of individual computing platforms must be taken into account in order 
to obtain efficient implementations on these platforms. In order to take the advantage of the processing 
power various platforms provide, algorithm and data structure implementations need to be platform- 
dependent and platform-aware. 

2.2,1 Parallelism in Distributed-Memory 

Distributed-memory parallel platform was the first platform that the DiVinE tool was adapted to. The 
intention was to aggregate computational power and distributed system memory of multiple network 
interconnected workstations (clusters) in order to facilitate the verification of large model checking in- 
stances fT?, 141 . The general idea of employing the distributed-memory platform for execution of a 
parallel graph algorithm was, and still is, as follows. The set of vertices of the graph to be processed 
is partitioned among participating computation nodes using a static partitioning function. When a com- 
putation node processes a vertex it enumerates all its immediate successors and checks them for their 
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Figure 2: DiVinE performance - optimized ond original implementaion of OWCTY algorithm. 



ownership. If a newly generated vertex is local according to the partitioning function, it is pushed to 
the local queue where it waits for further processing. In the other case a network message is created 
containing the vertex and sent to the queue of the owning computation node. With this work-flow there 
is a message generated with every edge connecting vertices from different partitions of the graph. This 
is where the theory is done, however, when it comes to the implementation there are still numerous de- 
sign choices to be made. Some of them are detailed for individual computing platforms in the following 
subsections. 

Message aggregation and buffering are the standard techniques in parallel computing to alleviate 
the burden of network communication overhead. Therefore, DiVinE tool maintains buffers of messages 
to be sent to individual network-attached computing nodes. In the first implementation of DiVinE, a 
buffer was flushed (messages were sent to network) upon one of the following situations: 1) buffer was 
explicitly flushed by the executed graph algorithm, 2) maximal number of messages to fit the buffer has 
been reached, 3) the local computing node was (otherwise) idle, and 4) messages in the buffer were too 
old. Deep experimental evaluation, however, showed that the fourth condition is completely ineffective 
in terms of network flow, while its checking is quite expensive. After dropping the fourth rule for flushing 
of buffers DiVinE significantly gained in performance. 

There were other distributed-memory performance bugs in earlier versions of DiVinE. For exmaple, 
uncontrolled polling of incoming network messages, massive flushing of all buffers at the same time, or 
insufficient separation of initialization and computation phases. For more details see ll82l . Cumulative 
effect of elimination of these bugs from DiVinE is shown in Figure [2] 

2.2.2 Parallelism in Shared-Memory 

Most techniques and results known from the distributed-memory setting are straightforwardly applicable 
to shared-memory architectures. DiVinE architecture follows this observation, which means that if Di- 
VinE is executed on a multi-core machine with shared-memory, it mimics distributed-memory behavior. 
In particular, the graph to be processed is partitioned among individual parallel shared-memory threads 
in the same way as it would be in the distributed-memory setting. Each individual thread maintains its 
own hash table and its own pool of vertices to be processed. Vertices belonging to different threads are 
pushed to their local pools by means of lock-free shared-memory queues [ 10|. Relative advantages and 
disadvantages of shared versus private hash tables within the context of thread-private pools of vertices 
to be processed have been discussed in 1211 . These approaches were evaluated, both theoretically and 
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practically, in a prototype implementation ifTTTl . 

Nevertheless, the scalability of parallel distributed-memory solutions to shared-memory is often lim- 
ited. Therefore, shared-memory specific techniques are needed to improve the efficiency and scalability 
of existing parallel distributed-memory solutions on shared-memory architectures. Examples of suc- 
cessful shared-memory specific techniques include, e.g., shared communication data structures ll60l[T0ll . 
specific termination detection techniques [Wl, dual-core algorithms f58l, or quite a unique partitioning 
scheme [57]. As for DiVinE, it seems that the design choice of having thread-private pools of vertices to 
be processed was not the best one VJIL However, an experimental confirmation still waits to be done. 

2.2.3 Employing External Memory 

Efficient algorithmic usage of computing devices with memory hierarchies is an established research 
topic fJ5t . Numerous algorithms were devised to efficiently utilize external-memory block devices, such 
as disks. The efficiency of such an algorithm is typically measured in the number of I/O (input/output) 
operations. To that end, the I/O efficient complexity has been defined 1 1 1 and the standard breadth-first 
graph traversal algorithms adapted to the I/O setting. The crucial technique used to do so is the so 
called delayed duplicate detection ll33l that has been further improved in 11831 |3l |49l and specialized for 
undirected graphs |[68l l69l . Regarding formal verification, the graph traversal algorithms are used for 
state space generation and verification of safety properties, see e.g. disk extension of the verification tool 
Mur9 |[80ll78l. 

As for problems beyond the state space generation, breadth-first search graph traversal algorithms are 
unsuitable. Therefore, the first approach to LTL model checking with external memory device employed 
a generic reduction of the LTL model checking problem to the reachability problem [23]. Unfortunately, 
such a reduction results in a quadratic grow in the memory demands, which effectively eliminates its 
application to large scale industry cases. Therefore, "incomplete" verification approaches dominated 
the research field for some time. We have seen random walks strategies implemented [64], iterative 
deepening and A* algorithms 16111621 . or breadth-first search based approaches with limited amount of 
stored information 11721 to be used. 

The I/O branch of DiVinE was started with the invention of a new I/O algorithmic technique that 
efficiently avoids the quadratic space overhead fT9l- The new approach was further improved by intro- 
duction of the so called merge omissions [20| that allowed for more efficient delayed duplicate detection 
in the later stages of the computation. Various formulas for control of what should be omitted were 
introduced fl31 . however, they were not implemented within the I/O branch of DiVinE. A completely 
different technique for trading time for space employing perfect hashing has been implemented in the I/O 
branch of DiVinE. This technique is referred to as the semi-external approach to LTL model checking 
problem [,40 J . 

2.2.4 Many-Core Parallelism 

After NVIDIA's CUDA technology ll38l was introduced, a lot of computational demanding tasks have 
been accelerated by GPU-aware algorithms. Examples of GPU accelerated procedures include, but are 
not limited to, sorting [48l . reduce operations [52], or numerous biological and physical simulations, 
such as protein folding f63). As for the graph theory, successful adaptation of general graph traversal 
algorithms have been reported too [50, 5 1 \ demonstrating the tremendous computational power of the 
CUDA device. On the other hand, graphs to be explored efficiently with a CUDA accelerated algorithm 
must be encoded explicitly in a compact way. 
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The CUDA technology as a computing platform attracted also researches in the field of automated 
formal verification. The key challenge for which no satisfactory solution is known yet is how to accel- 
erate the generation of explicitly encoded state space graph from the implicit definition. Preliminary 
attempts to do so relate to explicit model checking. They suggest to employ massively parallel check for 
enabled transitions emanating from the vertices on the frontier of the search and their massive parallel 
execution I?ni42l. 

Once the state space is generated and explicitly represented in an appropriate sparse matrix like 
structure, many verification tasks can be accelerated using CUDA technology. This has been successfully 
demonstrated, e.g., on verification of probabilistic systems [25] or LTL model checking [17|. Latest 
developments in DiVinE CUDA tool |[T6l allow for efficient utilization of multiple CUDA devices ||5l 
and acceleration of detection of strongly connected components |,6J. 



3 Summary 

Platform dependent verification is an alternative approach how to make automated formal verification 
attractive for industry. Despite significant progress in the development of various specific techniques and 
tools on the algorithmic level, mainly for parallel architectures, there is still a gap between pseudo-code 
and implementation. Implementations must be tuned for specific platforms, e.g. memory access patterns 
seem to play crucial role. In platform depended verification we should learn to appreciate engineering 
solutions. 
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